Goto

Collaborating Authors

 violent completion


Muslim-Violence Bias Persists in Debiased GPT Models

Hemmatian, Babak, Baltaji, Razan, Varshney, Lav R.

arXiv.org Artificial Intelligence

Abid et al. (2021) showed a tendency in GPT-3 to generate mostly violent completions when prompted about Muslims, compared with other religions. Two pre-registered replication attempts found few violent completions and only a weak anti-Muslim bias in the more recent InstructGPT, fine-tuned to eliminate biased and toxic outputs. However, more pre-registered experiments showed that using common names associated with the religions in prompts increases several-fold the rate of violent completions, revealing a significant second-order anti-Muslim bias. ChatGPT showed a bias many times stronger regardless of prompt format, suggesting that the effects of debiasing were reduced with continued model development. Our content analysis revealed religion-specific themes containing offensive stereotypes across all experiments. Our results show the need for continual de-biasing of models in ways that address both explicit and higher-order associations.


Debiased Large Language Models Still Associate Muslims with Uniquely Violent Acts

Hemmatian, Babak, Varshney, Lav R.

arXiv.org Artificial Intelligence

Recent work demonstrates a bias in the GPT-3 model towards generating violent text completions when prompted about Muslims, compared with Christians and Hindus. Two pre-registered replication attempts, one exact and one approximate, found only the weakest bias in the more recent Instruct Series version of GPT-3, fine-tuned to eliminate biased and toxic outputs. Few violent completions were observed. Additional pre-registered experiments, however, showed that using common names associated with the religions in prompts yields a highly significant increase in violent completions, also revealing a stronger second-order bias against Muslims. Names of Muslim celebrities from non-violent domains resulted in relatively fewer violent completions, suggesting that access to individualized information can steer the model away from using stereotypes. Nonetheless, content analysis revealed religion-specific violent themes containing highly offensive ideas regardless of prompt format. Our results show the need for additional debiasing of large language models to address higher-order schemas and associations.